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Abstract 

Background. Mathematical models provide epidemiologists with powerful tools for quantita- 
tively assessing the effectiveness of control methods and uncovering underlying mechanisms 
of observed infection data. One of the key parameters is the transmission rate. The trans- 
mission rate of many acute infectious diseases varies significantly in time, but the underlying 
mechanisms are usually uncertain. They may include seasonal changes in the environment, 
contact rate, immune system response, etc. The transmission rate has been thought impossi- 
ble to measure directly. We derive an algorithm to recover the time-dependent transmission 
rate directly from infection data. 

Methodology/Principal Findings. The algorithm is derived from the complete and explicit 
solution of a mathematical inverse problem for SIR-type transmission models. We illustrate 
the algorithm with historic UK measles data and discover that the two dominant spectral 
peaks of the transmission rate have 2- and 1/3-year periods, respectively. In contrast, previ- 
ous measles transmission models assumed a one-year periodic transmission rate function to 
account for mixing of children in school. 

Conclusions/Significance. The main objective of this work is to provide a new algorithm to 
recover the transmission rate function from collectable infection data. Our algorithm also 
yields that almost any infection profile can be perfectly fitted by an SIR-type model with 
variable transmissibility. This clearly illustrates the danger of overfitting an SIR transmission 
model. 
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*The UK measles data and birth data was obtained from [23(] and [2J]. The authors very much appreciate the efforts 
of David Earn and Ben Bolker for maintaining these public databases of infectious disease data. 
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INTRODUCTION 



The SIR epidemic model was proposed by Kermack and McKendrik 15] and was extensively 
developed by Anderson and May [l] . One of the key parameters in these transmission models is the 
transmission rate, which is the product of the average number of contacts a susceptible individual 
has with infected individuals per unit time and the average probability of transmission during each 
effective contact. In Section 3.4.9 of Anderson and May [1], the authors state that "... the direct 
measurement of the transmission rate is essentially impossible for most infections. But if we wish 
to predict the changes wrought by public health programmes, we need to know the transmission rate 

The transmission rate of many accute infectious diseases varies significantly in time and fre- 
quently exhibits significant seasonal dependence Q, (22 ] : influenza, pneumococcus, and rotavirus 
cases peak in winter; RSV and measles cases peak in spring; and polio cases peak in summer. 
Measles outbreaks in several UK cities (pre-vaccine) followed a two-year cycle, and many inves- 
tigators have attempted to explain this two-year cycle with mathematical models. Since measles 
cases, as well as cases of several other childhood viral diseases such as pertussis, seem to strongly 
correlate with school terms and breaks, previous modelers have assumed that the transmission rate 

n 



can be represented by a simple trigonometric [11] or Haar [1J] function which has one- year period. 
Our work calls this assumption into question. 

Virtually all previous investigators have estimated the transmission rate using the formula 

where, for instance, S(k),I(k) are the proportions of susceptible and infected individuals during 
week k ^Q]- This formula requires knowledge of S(k), which is usually very difficult to estimate. 
Our new algorithm obviates this difficulty. 

We first consider the simplest SIR transmission model and allow the transmission rate to be a 
time-dependent function, i.e., there is a positive function f3(t) such that 

S'(t) = -f3(t)S(t)I(t), (2) 
I'(t) = p(t)S(t)I(t)-vI(t), (3) 
R'(t) = ul(t), (4) 

such that S(t),I(t), and R(t) are the proportions of susceptible, infected, and removed individuals 
of time t. We begin by asking the mathematical question: 
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Given smooth infection data on time interval [0, T] and recovery rate v > 0, can one always 
choose a transmission rate function f3(t) such that the SIR epidemic model always perfectly fits the 
smooth data with the given u? 

Mathematicians call this an inverse problem. We prove that this is always possible subject to 
a mild restriction on the infection data and and we provide an explicit formula for the solution. 
The construction also illustrates the danger of overfitting a transmission model where one can 
choose the time-dependent transmission rate. 

However, in practice, infection data are always discrete, not continuous. We show that one 
can robustly estimate f3(t) by first smoothly interpolating the data with a spline or trigonometric 
function and then applying the formula to smooth data. 

The usual transmission rate recovery method based on (1) can be viewed as a discretization 
of (3). Our new method not only avoids this crude discretization but also uses the additional 
information contained in (1). 

We extend our recovery algorithm to more general classes of transmission models including the 
SEIR epidemic model with vital rates, and we illustrate the extended algorithm using UK measles 
data during 1948-1966. Our recovered transmission rate f3{t) has two dominant spectral peaks: at 
frequencies 1/2 and 3 per year, respectively, with the former having the largest power. There is 
no significant peak corresponding to an annual cycle, as hypothesized by all previous authors. As 
a strong consistency check, our recovered f3(t) exhibits minima in July-August during the summer 
school vacation: the months with the fewest number of notifications, and maxima during the 
winter-spring period January-June: when there were the largest number of notifications. 

II. RESULTS 

We rigorously derive a mathematical algorithm for recovering the time-dependent transmission 
rate from infection data. This algorithm is applied to two simulated data sets representing two 
characteristic "types" of infectious diseases. We then illustrate the algorithm using UK measles 
data from 1948-1966. 

A. Recovery algorithm 

Our recovery algorithm is based on a mathematical solution of an inverse problem, which is 
expanded in Section [IV Al To recover the transmission rate f3(t) from an infection data set, the 
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recovery algorithm has four steps and requires two conditions. 

Step 1. Smoothly interpolate the infection data with spline or trigonometric functions to generate 
a smooth f(t). Check condition 1: f'(t)/f(t) > —v, where v is the removal rate. 

f"(t)f(t) - fit) 2 

Step 2. Compute the function pit) = — w „, , , tttt- Condition 1 prevents a zero denomina- 

tor in pit). 

Step 3. Choose /3(0) and compute the integral P(t) = / p(r)dr. Check condition 2: /3(0) < 
1 / J e p ^ f(s)ds , where T is the time length of the infection data. Alternatively, choose 



(3(0) sufficiently small to satisfy condition 2. 



Step 4. Apply the formula (3(t) = 1 
the given interval [0 T]. 



e -P{t) /m _ e -P{t) f e P{s) f{s)ds 

Jo 



to compute (3(t) on 



Condition 1 is equivalent to d(ln f (t)) / dt > —v, i.e., the time series of infection data cannot 
decay too fast at any time. This is a mild condition that most data sets satisfy. If a data set 
does not satisfy this condition, we propose a scaling trick in Section IIIII to be able to apply the 
algorithm. 

In Section IIV Bl we present extensions of the basic recovery algorithm to several popular exten- 
sions of the SIR model, including the SEIR model with variable vital rates. Our algorithm can be 
extended to virtually any such compartment model. 

B. Recovering the transmission rate from simulated data 

We first illustrate the recovery algorithm using two simulated data sets. The functions f(t) and 
g(t) are the fractions of the infected population for two characteristic "types" of infectious diseases. 
The first data set simulates an infectious disease with periodic outbreaks, as observed in measles 



(before mass vaccination) and cholera , lljj. The periodic function f(t) = 10" 5 [1.4 + cos(1.5t)] 
represents the continuous infection data, and Figure 1(a) contains plots of both fit) (solid) and 
its associated transmission rate function f3(t) (dashed). 

The second data set simulates an infectious disease with periodic outbreaks that decays in time, 
as observed in influenza [20]. The periodic function g(t) = 1CP 5 [1.1 + sin(i)] exp(— O.lt) represents 
the continuous infection data, and Figure 2(a) contains plots of both g(t) (solid) and its associated 
transmission rate function j3(t) (dashed). 
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We extract discrete data from functions f(t) and g{t) by sampling them at equi-spaced inter- 
vals (see the small black squares in Figure 1(a) and Figure 2(a)). To each discrete time series, we 
apply two well-known interpolation algorithms (trigonometric approximation and spline approxi- 



mation) [16|, |21[. Figure 1(b) and Figure 2(b) contain plots of f3(t) obtained from the two smooth 
interpolations together with the recovery algorithm. Both interpolation schemes yield excellent 
approximations of j3{t) in both examples. 

Many simulations show that the recovery algorithm is robust with respect to white noise up to 
10% of the data mean, as well as the number of sample points. 

The peaks of (3{t) in Figure 1 are increasing over time. This is a manifestation of the choice 
of /3(0) and v. Different values of /3(0) and v may lead to the peaks increasing, decreasing, or 
non-monotone over time. We will see in Section III CI that for historic UK measles, the important 
global characteristics of (3(t) are robust with respect to /3(0). 

C. Recovering the transmission rate from UK measles data 

Previous studies {(], 13] employed the SEIR model with vital rates to explore the epidemic and 
endemic behaviors of measles infections, using the notification data in To compare our new 
recovery technique with previous measles studies, we extend our recovery algorithm to the SEIR 
model with variable vital rates, (see Section [IV B 5ft . 

We use the parameter values from Anderson and May , OPCS et al. Q: v = 52/year= 
52/12/month (where 1/v is the removal period), a = 52/year= 52/12/month (where 1/a is the 
latent period), and 5 = 0.06/year= 0.06/12/month (death rate). 

Public databases, such as Bolker's measles data archive [24] and the International Infectious 
Disease Data Archive [23|, contain the quarterly reported historical UK birth rates from 1948—1956. 
During 1948 — 1956 the births show large annual variations (see Figure 3(c)) with a strong 1/year 
frequency component (see Figure 3(d)). Some years these variations approach 20%. We follow 
custom and include actual births in our model. Since neither database contains the detailed birth 
rates (quarterly or monthly) from 1957—1966, this requires us to restrict to the period 1948 — 1956. 

We aggregate the weekly incidence data to obtain monthly data (see Figure EJ^a)). The aggrega- 
tion filters the data and reduces noise. There is a well-documented and well-studied underreporting 
bias in the UK measles data [5l, roll, thus we incorporate the standard correction factor of 40% in 

nn 

our simulations. We assume a constant correction factor as previous studies [5J, |6J], although the 
recovered f3(t) may be influenced by subtleties of the underreporting. In addition, we smoothly 
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interpolate the quarterly birth data to obtain monthly birth data. We then smoothly interpolate 
the monthly data (infection and birth) and compute /3(t) using the extended recovery algorithm. 

The authors [j], [rJ used a school term-based step function for the transmission rate, which 
attains the values 846/year during January 1-6 and 1408/year during January 7-31. We choose the 
mean value of their transmission rate function in January, /3(0) = 1299 as the value of /3(0) in our 
simulations. 

In Figure EJe) we plot the annual transmission rate f3(t) recovered from our algorithm. As a 
strong consistency check, note that annual minima occur in July- August during the summer school 
vacation period and annual maxima occur between January and June. This corresponds precisely 
with infection data and the assumptions of other modelers. 

In Figure [3{f) we plot the modulus of the Fourier transform of f3(t) and observe that the 
dominant spectral peaks have 2 and 1/3-year periods, with the former having higher power. It is 
interesting to compare these periods to the dominant spectral peak of 1-year period for the birth 
rate (Figure E^d)) and 2 and 1-year periods for the number of infections (Figure E^b)). All previous 
measles transmission models assumed (3{t) to be a sine or step function with one-year period, to 
account for mixing of children in school. The three times per year cycle seems quite curious, and 
we are not yet able to explain the biological significance of this. Our numerical simulations show 
that the dominant frequencies of f3{t) are quite robust with respect to the choice of /3(0). 

In Figure 4, we compare our recovered Bit) with the school term-based step transmission rate 

q n 

function introduced by Earn et al. [9J], Keeling and Rohani [13J. Although it is evident that the 
graphs look quite different, there are a number of qualitative similarities, such as when their annual 
maxima and minima occur. 



III. DISCUSSION 



In this paper, we provide a new algorithm to recover the time-dependent transmission rate 
from infection data. The algorithm only requires the number of infected individuals at each time 
interval, and does not require the number of susceptible individuals, as did the previously used 
method. 

The algorithm should apply to the vast majority of infection data sets, and a consequence is 
that one can nearly always construct a time-dependent transmission rate (3{t) such that SIR model 
will fit the data perfectly. This illustrates a potential danger of overfitting an epidemic model with 
time-dependent rate functions. 
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We illustrate the recovery algorithm for the SEIR model with variable vital rates, and we apply 
this algorithm to UK measles data from 1957 — 1966. The Fourier transform of our recovered 
transmission rate shows the two dominant spectral peaks have frequencies 1/2 and 3 per year, 
respectively. All previous measles transmission models assumed a one-year periodic transmission 
rate function to account for mixing of children in school. 

Using the available yearly birth data from 1957 — 1966 [23J, we recovered f3(t) over this period. 
The birth rate is almost a constant since the yearly birth data cannot show annual oscillations. 
We have verified that the conditions of the recovery algorithm are satisfied under the assumption 
of a constant birth rate (equal to the death rate, as assumed by many measles modelers) during 
the entire period pre-vaccination 1948 — 1966. 

Our algorithm has some limitations to its applicability. First, the proportion of infected indi- 
viduals, f(t), can not decrease too fast over the full time interval of interest. In general, one can 
add a sufficiently large constant to f(t) to ensure this, but this will change the range of applicable 
/3(0), and applicability needs to be checked. Second, one must assume that the proportion (or 
number) of notifications is always strictly positive. In practice this restriction can be overcome by 
replacing zero values in the time series with a very small positive value. Finally, one either needs 
to know the value of the transmission rate at some fixed time, or verify that the desired properties 
of f3{t) hold for all /3(0) in the range where it can be estimated. It may be easiest to estimate 
the transmission rate when it is near its minimum, such as during the summer vacation period of 
childhood viral diseases. 

IV. MATERIALS AND METHODS 
A. Mathematical derivation 

The recovery algorithm follows from the following complete solution of the inverse problem. 

Theorem IV. 1 Given a smooth positive function f(t), v > 0, and T > 0, there exists K > such 
that if (3(0) < K there is a solution /3(t) with /3(0) = /3(0) such that I(t) = f(t) for < t < T if 
and only if f'(t)/f(t) > -v for < t < T. 

The growth condition imposes no restrictions on how f(t) increases, but requires that f(t) 
cannot decrease too quickly, in the sense that its logarithmic derivative is always bounded below 
by —v. It is easy to see that f'(t)/f(t) > —v is a necessary condition, since Equation ([3]) implies 
that f'(t) + uf{t) = f3(t)S(t)f(t), which must be positive for < t < T. 
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The proof of the theorem consists of showing that this condition is also sufficient. We rewrite 
Equation ([3]) as 

sit) = mm • (5) 

then compute S'(t), and then equate with Equation ([2]) to obtain 



<it\ «f)/(i) ; /?(()/(() 
Calculating the derivative and simplifying the resulting expression yields the following Bernoulli 
differential equation for f3(t) 

P'(t)-p(t)f3(t)-f(t)(3 2 (t) = 0, where pit) = f'^Jfy ~ ^ • (7) 

The change of coordinates x(t) = l/(3(t) transforms this nonlinear ODE into the linear ODE 

x '(t)-p(t)x(t)-f(t) = 0. (8) 

The method of integrating factors provides the explicit solution 

_L = x (t) = x(0)e" p(i) - e~ p(t) f e p(s) f(s)ds, where P(t) = f ' p(r)dr. (9) 
P\P) Jo Jo 

A problem that could arise with this procedure is for the denominator of p(t) to be zero. A 

singular solution is prevented by requiring that the denominator be always positive, i.e., f'(t) + 

vf{t) > 0. Having done this, to ensure that f3(t) is positive, /3(0) must satisfy 

/ e p ^f(s)ds < 1/0(0). (10) 
Jo 

Mathematically, there are infinitely many choices of /3(0) and thus infinitely many transmission 
functions f3(t). In this sense the inverse problem is under-determined. This observation clearly 
illustrates a serious danger of overfitting such an epidemic transmission model. 



B. Extensions of the basic model 

Analogous results and inversion formulae hold for all standard variations of the standard SIR 
model and their combinations. The proofs are very similar to the proof of Theorem pV.ip . Here, 
we only present the full algorithm for the SEIR model with vital rates, since we apply this algorithm 
to UK measles data. 
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1. SIR model with vital rates 

S'(t) = 5 - l3(t)S(t)I(t) -5S{t), (11) 

I 1 {t) = (3(t)S(t)I(t) - ul{t) - 5I(t), (12) 

R'(t) = uI(t)-5R(t). (13) 

The necessary and sufficient condition for recovering f3(t) given v and S is f'(t)/f(t) > — (y + 5). 

2. SIR model with waning immunity 

S'{t) = mR(t) - P(t)S(t)I(t), (14) 

I'(t) = P(t)S(t)I(t)-vI(t), (15) 

R'(t) = vl{t) -mR(t), (16) 

where 1/m is the memory period of immunity. The necessary and sufficient condition for recovering 
P(t) given v is /'(*)//(*) > 



3. SIR model with time- dependent indirect transmission rate (Joh et al. fill /) 

S'(t) = -u>(t)S(t), (17) 

I'(t) = u(t)S(t) - vl(t), (18) 

R'(t) = ul(t), (19) 

where u(t) is the time-dependent indirect transmission rate. The necessary and sufficient condition 
for recovering /3(t) given v is f'(t)/f(t) > —v. 

4- SEIR model 

S'{t) = -p(t)S(t)I(t), (20) 

E'(t) = (3(t)S(t)I(t) -aE(t), (21) 

I'(t) = aE(t) - ul(t), (22) 

R'(t) = ul(t), (23) 

where 1/a is the latent period for the disease. By simple calculations, we can show that the 

necessary and sufficient condition for recovering (3(t) from infection data is f'(t)/f(t) > —v. 
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5. SEIR model with vital rates 

S'(t) = 6-P(t)S(t)I(t)-6S(t), (24) 

E'{t) = f3(t)S(t)I(t) - aE{t) - 5E(t), (25) 

I'(t) = aE{t)-vI{t)-8I{t), (26) 

R'(t) = uI(t)-6R(t). (27) 

The necessary and sufficient conditions for recovering f3(t) from infection data are 

f'(t) + (u + S)f(t) > and f"{t) + (v + 25 + a)f'{t) + {5 + a)(v + 5)f{t) > 0. (28) 

In this case, f3{t) satisfies the Bernoulli equation 

(3'+p(t)P + q(t)f3 2 = 0, (29) 

where 

-af'"{t)f(t) - a(u + 25 + a)f"{t)f(t) - a(5 + a){u + 5)f'(t)f(t) + af"{t)f'(t) + a(u + 25 + a)f'(t) 



2 



Pit) = 
+ 



af(t)[f"(t) + (u + 25 + a) f'{t) + (5 + a)(u + 5)f(t)] 
a(5 + a)(u + 5)f'(t)f(t) - 5af"(t)f(t) - 5a(u + 25 + a)f'(t)f(t) - 5a(5 + a){v + 5)f(t) 



af(t)[f"(t) + (u + 25 + a)f'(t) + (5 + a)(u + 8)f(t)\ 
and 

5a 2 f(t) - af"(t)f(t) - q(u + 25 + a)f'(t)f 2 (t) - a(5 + a)(u + 5)f(t) 
qU af(t)[f"(t) + (u + 25 + a)f'(t) + (5 + a)(u + 5)f(t)} 

The modified recovery algorithm has five steps together with three conditions. 

Step 1. Smoothly interpolate the infection data to generate a smooth function f(t) that has at 
least a continuous second derivative. Check condition 1: f'(t) + (v + 5)f(t) > 0; and check 
condition 2: /" (t) + (y + 25 + a)f'(t) + (5 + a){y + 5)f{t) > 0. 



Step 2. Compute the function p(t) 



af(t)lf"(t) + (v+25+a)f>(t)+(t+a)(v+S)f(t)\ 



, a(S+a)(u+5)f'(t)f(t)-&af''(t)f(t)~&a(u+2&+a)f'(t)f(t)-5a(&+a)(u+5)f 2 (t) 
~T~ af(t){f"(t)+(v+26+a)f'(t)+{6+a)(v+5)f(t)} 

Step 3. Choose /3(0) and compute the integral P(t) = / p{r)dT. Check condition 3: 

J o 

1 

w 



StPn 4 Tnmnnte the function n(t) - ^ ^~ af " (.t)-a(u+2S+a)f' (t)f (t)-a(8+a)(u+8)f (t) 
Step 4. Compute tne function q{t) - a f(t) [/" {t)+{u+2S+a) /' (t)+(5+a) [u+S) f{t)\ ■ 
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Step 5. Apply the formula 0(t) = 1 j e m / f3{0) + e p{£) e~ p( ^q(s)ds 
the given interval [0 T]. 



to compute j3{t) on 



If we consider the variable birth rate rj(t) and the constant death rate 5, then the SEIR model 
becomes 



S'(t) = r)(t)-(3(t)S(t)I(t)-6S(t), 

E'{t) = f3(t)S(t)I(t) - aE{t) - 5E(t), 

I'(t) = aE(t) - vl{t) - SI{t), 

R'(t) = vl{t) - 5R(t). 



(30) 
(31) 
(32) 
(33) 



In this case, the formula in Step 4 should be 



V (t)a 2 f 2 (t) - af"(t)f\t) - a{y + 25 + a)f{t)f\t) - a{5 + a){u + S)f\t) 



af(t)[f"(t) + (u + 25 + a)f'(t) + (5 + a){u + 5)f{t)] 
All other steps in the algorithm remain the same. 
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Figure 1. (a) We extract 21 equally spaced data points from the periodic function f(t) = 
10~ 5 [1.4+cos(1.5i)]; the dashed curve is /3(t) recovered from f(t) using (9). (b) These transmission 
functions are estimated using spline and trigonometric interpolations on the 21 data points. 

Figure 2. (a) We extract 21 equally spaced data points from the oscillatory decaying function 
g (t) = KT 5 [l.l + sin(i)]exp(-0.1i); the dashed curve is f5{t) is j3{t) recovered from g(t) using (9).. 
(b) These transmission functions are estimated using spline and trigonometric interpolations on 
the 21 data points. 

Figure 3. (a) Aggregated monthly measles data from England and Wales in 1948 — 1956. (b) 
The Fourier transform of filtered and smoothly interpolated aggregated monthly data showing the 
dominant frequency components (normalized modulus), (c) Birth rates during 1948—1956 reported 
from major UK cities, (d) The Fourier transform of smoothly interpolated UK birth data showing 
the dominant frequency component (normalized modulus), (e) The transmission rate function /3(t) 
obtained from the aggregated monthly data with the incorporation of a 40% correction factor (for 
account for underreporting) and historical birth rates; note that historical birth rates need to be 
normalized by the population size, (f) The Fourier transform of filtered (3{t) with historical birth 
rates showing the dominant frequency components. Note: we filter and remove the artificial peak 
at zero frequency in the Fourier transform. 

Figure 4. Compares f3{t) estimated using our recovery algorithm incorporating historical birth 
rates with the term-based step function proposed by Keeling and Rohani [ljj. The initial trans- 
mission rate in our estimation is /3(0) = 1299/year, which is the averaged transmission rate in 
January computed from the term-based step function. 
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